{
 "cells": [
  {
   "cell_type": "markdown",
   "source": [
    "## Overview\n",
    "\n",
    "- **Expectation Used:** [expect_profile_numeric_columns_diff_between_inclusive_threshold_range](https://github.com/great-expectations/great_expectations/blob/develop/contrib/capitalone_dataprofiler_expectations/capitalone_dataprofiler_expectations/expectations/expect_profile_numeric_columns_diff_between_inclusive_threshold_range.py)\n",
    "\n",
    "- **Expectation Description:** This expectation will check the user-specified difference between user-specified metrics from an original report generated by the Data Profiler and a new report. The two reports are generated on two different sets of data with matching schemas.\n",
    "\n",
    "- **Example Details:** In this example a data owner has a dataset containing salary information on individuals in the data science field ranging from **2020** to **2022**. The data owner has been tasked with checking the historical trends of salary growth across the industry with the numbers they have. They want to use this expectation to determine if the median growth of salaries at their company are within the same range of median growth as the companies they have data on."
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "### Imports"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "# Great expectations imports\n",
    "import great_expectations as ge\n",
    "from capitalone_dataprofiler_expectations.expectations. \\\n",
    "    expect_profile_numeric_columns_diff_between_inclusive_threshold_range \\\n",
    "    import ExpectProfileNumericColumnsDiffBetweenInclusiveThresholdRange\n",
    "from great_expectations.self_check.util import build_pandas_validator_with_data\n",
    "\n",
    "# Data Profiler import\n",
    "import dataprofiler as dp"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "### Setup\n",
    "Below we are going to import a dataset from the Data Profile testing suite. This csv holds information on the salaries of individuals in the data science field from all over the world."
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [
    "context = ge.get_context()"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [
    "data_path = \"../../dataprofiler/tests/data/csv/ds_salaries.csv\"\n",
    "data = dp.Data(data_path).data\n",
    "data"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "For this expectation we are going to compare the diff median `salary_in_usd` between `work_year` **2020** and **2022**. Below we are gathering the different years that are recorded in this dataset."
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [
    "data.sort_values(by=\"work_year\", axis=0, inplace=True)\n",
    "years = data[\"work_year\"].unique().tolist()\n",
    "years"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "Now that we have the years, we will capture all records from each year in their own dataframes so that we can process them separately."
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [
    "individual_dataframes = []\n",
    "for year in years:\n",
    "    current_year_df = data.loc[data[\"work_year\"]==year]\n",
    "    current_year_df = current_year_df.drop(\"work_year\", axis=1)\n",
    "    individual_dataframes.append(current_year_df)\n",
    "year_2020_salary_df = individual_dataframes[0]\n",
    "year_2022_salary_df = individual_dataframes[2]"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "Now we will create a report on the first individual dataframe which corresponds to the year **2020**, then we will output the median `salary_in_usd` from this dataframe."
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [
    "profiler_options = dp.ProfilerOptions()\n",
    "profiler_options.set({\"data_labeler.is_enabled\": False})\n",
    "\n",
    "profile = dp.Profiler(year_2020_salary_df, len(year_2020_salary_df), options=profiler_options)\n",
    "profile.save(filepath='year_2020_salary_profile.pkl')\n",
    "report  = profile.report(report_options={\"output_format\": \"compact\"})"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "Let's take a look at the output `median` metric from the `salary_in_usd` column as found in the report"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [
    "report['data_stats'][6]['statistics']['median']"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "The data owner has been told by his company that median salary increases were about **10000** from **2020** to **2022**. So the data owner wants to for their range around this number. With the validator below, an expectation is set up with a difference between the `median` `salary_in_usd` of at least **7500** and no more than **12500**. Meaning, this validator will check if the `median` `salary_in_usd` has increased by **7500** to **12500** from **2020** to **2022** across all the companies they have data on."
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [
    "validator = build_pandas_validator_with_data(year_2022_salary_df)\n",
    "results = validator.expect_profile_numeric_columns_diff_between_inclusive_threshold_range(\n",
    "    profile_path='year_2020_salary_profile.pkl',\n",
    "    limit_check_report_keys={\n",
    "            \"salary_in_usd\": {\n",
    "                \"median\": {\"lower\": 7500, \"upper\": 12500},\n",
    "            },\n",
    "        }\n",
    ")"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "### Results\n",
    "From the output below, the data owner can see that the expectation has an unexpected value. The result shows that the diff between the two profiles is significantly more than the upperbound at about **44000**. This indicates to the data owner that the salary growth trends in competing companies have increased more than his own company in the last two years."
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [
    "results"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
